Uncovering the Hidden Structures in Press Releases of Canadian Federal Parties Issued During the 2015 Election using word2vec and t-SNE

The 2015 Canadian federal election is in its final stretch and at Datacratic, we thought it would be a great opportunity to collect and play with some data.

Each party has a different view on what is important to move Canada forward. This has to effect of centering their official message around certain themes. The question we asked is: can we find patterns in the words used in each party’s press releases? Do some deal more with certain themes than others?

The answer is Yes.

This image is a static scatter plot of press release. Each dot represents a press release, colored by each party’s official color, and press releases are positioned such that the ones that are similar are close-by. The axis are unitless.

The labels (and pink dots) represent major themes that we hand selected and then projected in the 2D space computed by our algorithms to help interpret of the space.

The rest of this notebook is the exact code that was used to generate the scatter plot with MLDB along with an interactive plot at the bottom. The algorithms used to produce this image include word2vec and t-SNE.

Creating the plot


In [458]:
import json, pandas, requests
from datetime import datetime
import numpy
from pymldb import Connection
mldb = Connection()

Importing the data

We wrote a small scraper using the scrapy Python library. We scraped the press releases of the NDP, the Liberals, the Green Party and the Conservatives. The resulting data was save in a CSV file with one press release per line.

As mentioned above, we added a few lines representing the major themes we wish to project in our space.


In [717]:
ds = mldb.v1.datasets("raw")
ds.put({
    "type": "text.csv.tabular",
    "params": {
        "dataFileUrl":"file:///mldb_data/all_parties_clean.csv",
        "ignoreBadLines": True,
        "rowNamePrefix": "pr"
    }
})


Out[717]:
PUT http://localhost/v1/datasets/raw
201 Created
{
  "status": {
    "num_skipped_lines": 150
  }, 
  "config": {
    "params": {
      "ignoreBadLines": true, 
      "rowNamePrefix": "pr", 
      "dataFileUrl": "file:///mldb_data/all_parties_clean.csv"
    }, 
    "type": "text.csv.tabular", 
    "id": "raw"
  }, 
  "state": "ok", 
  "type": "text.csv.tabular", 
  "id": "raw"
}

In [718]:
mldb.query("select party, count(*) from raw where to_timestamp(date) > to_timestamp('2015-08-01') group by party")


Out[718]:
count(*) party
_rowName
["category"] 17 category
["conservative"] 104 conservative
["green"] 90 green
["liberal"] 264 liberal
["ndp"] 162 ndp

In [719]:
mldb.query("select * from raw limit 1")


Out[719]:
date full_text link party title
_rowName
pr000000000763 2015-06-11T00:00:00 Merci Cline Membres du Conseil d'administrat... http://www.ndp.ca/news/toms-speech-to-board-tr... ndp Tom's speech to the Board of Trade of Metropol...

Creating bags of words

The next step is to take the full_text column and tokenize its contents. We do this because we will need to compare press releases using their content.

The tokenize function will split the column into individual words, remove words with a length of less than 3 and apply a stoplist to it.


In [720]:
bag = mldb.v1.procedures("baggify")
bag.put({
    "type": "transform",
    "params": {
        "inputDataset": "raw",
        "outputDataset": {
            "id": "bag_of_words",
            "type": "beh.mutable"
        },
        "select": """tokenize(full_text, {' ?!;/[]*"' as splitchars, '' as quotechar}) as *""",
        "where": "full_text IS NOT NULL AND title IS NOT NULL AND to_timestamp(date) > to_timestamp('2015-08-01')"
    }
})
mldb.v1.datasets("bag_of_words").delete()
bag.runs.post({})


Out[720]:
POST http://localhost/v1/procedures/baggify/runs
201 Created
{
  "runStarted": "2015-10-16T03:01:44.905Z", 
  "state": "finished", 
  "runFinished": "2015-10-16T03:01:46.383Z", 
  "id": "2015-10-16T03:01:44.901528Z-5bc7042b732cb41f"
}

The result is a clean list of words that will be easy to compare across press releases.


In [721]:
mldb.query("select * from bag_of_words limit 1")


Out[721]:
accessible achieve afford announced balanced based belief budgets canada canadas ... tim today transfer transfers true universally values weaken work wreck
_rowName
pr000000000085 1 1 1 2 1 1 1 1 1 1 ... 1 1 1 2 1 1 1 1 1 1

1 rows × 106 columns

What are the most common words?


In [722]:
df = mldb.query("select sum({*}) as * from bag_of_words")
df2 = df.T
df2.columns = ["count"]

In [723]:
df2.sort(columns="count", ascending=False)[:25]


Out[723]:
count
canadians 1010
canada 918
plan 834
harper 754
mulcair 731
government 730
liberal 688
party 653
canadian 605
trudeau 602
ndp 550
stephen 520
minister 508
economy 498
prime 479
justin 465
green 447
today 390
leader 382
families 372
change 356
conservative 355
canadas 346
jobs 341
middle 305

Use word2vec to embed our bag of words

The word2vec tool, described here and here, is used to embed words into a high dimensional space.

Word2vec gives us word embeddings, but since what interests us is press releases and each one is represent by a bag of words, we actually need document embedding. This means that for a given press release, we need to embed each of its words and then combine them all to get a representation of the press release in the word2vec space. To acheive this, we simply averaged out each dimension of each press release's word embeddings to get the coordinates for the press release.

Load the trained word2vec data


In [724]:
mldb.v1.datasets("w2v").delete()
w2v = mldb.v1.procedures("w2vimport")
w2v.put({
    "type": 'import.word2vec',
    "params": {
        "dataFileUrl": 'file:///mldb_data/GoogleNews-vectors-negative300.bin',
        "output": {
            "type": 'embedding',
            "id": 'w2v'
        }
    }
})
w2v.runs.post({})


Out[724]:
POST http://localhost/v1/procedures/w2vimport/runs
201 Created
{
  "runStarted": "2015-10-16T03:01:48.765Z", 
  "state": "finished", 
  "runFinished": "2015-10-16T03:02:57.330Z", 
  "id": "2015-10-16T03:01:48.764986Z-5bc7042b732cb41f"
}

In [725]:
mldb.query("select * from w2v limit 5")


Out[725]:
000000 000001 000002 000003 000004 000005 000006 000007 000008 000009 ... 000290 000291 000292 000293 000294 000295 000296 000297 000298 000299
_rowName
0 0.152344 -0.121094 0.102051 -0.083984 -0.184570 0.015320 0.238281 -0.478516 0.072754 0.218750 ... 0.028320 -0.164062 -0.173828 0.361328 -0.201172 -0.142578 -0.021606 0.013794 -0.057373 0.277344
Pragmatix 0.007721 0.123047 0.028076 -0.157227 -0.034424 -0.053955 0.037598 -0.071777 -0.149414 -0.072754 ... -0.100586 -0.160156 -0.042969 -0.013062 0.075684 -0.002808 0.014893 -0.011414 0.184570 -0.047852
griefing 0.091797 -0.113281 0.029907 0.048584 -0.241211 0.195312 0.034912 0.302734 -0.034424 -0.082520 ... 0.104980 0.233398 -0.118652 -0.089355 -0.142578 0.062988 0.061768 0.054199 -0.225586 0.187500
Richard_Altomare 0.056396 0.455078 -0.189453 -0.055420 0.096191 -0.296875 0.195312 0.240234 0.224609 -0.104004 ... -0.139648 -0.141602 0.003479 0.159180 0.281250 0.010376 -0.174805 0.060547 -0.025269 -0.174805
quantum_physicist 0.227539 -0.076172 0.248047 0.112305 0.004913 0.146484 0.025391 -0.168945 -0.001282 0.034180 ... 0.167969 -0.087402 -0.292969 0.045898 -0.058594 -0.206055 -0.172852 -0.164062 0.094238 0.099609

5 rows × 300 columns

Prepare the datasets

We now have the w2v dataset that contains embeddings for words. We also have the bag_of_words dataset that contains the words presents in each press release. To compute the embedding for the press releases, we need to setup a joined dataset.


In [726]:
ds = mldb.v1.datasets("tranposed_bag_of_words")
ds.put({
    "type": "transposed",
    "params": {
        "dataset": {"id": "bag_of_words"}
    }
})


Out[726]:
PUT http://localhost/v1/datasets/tranposed_bag_of_words
201 Created
{
  "config": {
    "params": {
      "dataset": {
        "id": "bag_of_words"
      }
    }, 
    "type": "transposed", 
    "id": "tranposed_bag_of_words"
  }, 
  "state": "ok", 
  "type": "transposed", 
  "id": "tranposed_bag_of_words"
}

In [727]:
proc = mldb.v1.datasets("bow_merged")
proc.put({
    "type": "merged",
    "params": {
        "datasets": [{"id": "tranposed_bag_of_words"}, {"id": "w2v"}]
    }
})


Out[727]:
PUT http://localhost/v1/datasets/bow_merged
201 Created
{
  "status": [
    null, 
    null
  ], 
  "config": {
    "params": {
      "datasets": [
        {
          "id": "tranposed_bag_of_words"
        }, 
        {
          "id": "w2v"
        }
      ]
    }, 
    "type": "merged", 
    "id": "bow_merged"
  }, 
  "state": "ok", 
  "type": "merged", 
  "id": "bow_merged"
}

In [776]:
rows = requests.get("http://localhost/v1/datasets/bag_of_words/rows").json()
print rows[:5]


[u'pr000000000085', u'pr000000000062', u'pr000000001003', u'pr000000000069', u'pr000000000055']

The bow_merged dataset now contains both columns representing the embedding for each word as well as the pr* columns, representing how many times the given word appeared in the given press release.


In [778]:
mldb.query("select * from bow_merged limit 1")


Out[778]:
000000 000001 000002 000003 000004 000005 000006 000007 000008 000009 ... 000290 000291 000292 000293 000294 000295 000296 000297 000298 000299
_rowName
0 0.152344 -0.121094 0.102051 -0.083984 -0.18457 0.01532 0.238281 -0.478516 0.072754 0.21875 ... 0.02832 -0.164062 -0.173828 0.361328 -0.201172 -0.142578 -0.021606 0.013794 -0.057373 0.277344

1 rows × 300 columns

Doing the following query will compute the average of the embeddings for the release represented by rows[0]. The >0 condition translates to: "for the words present in the release".


In [777]:
mldb.query("select avg({* EXCLUDING(pr*)}) as * from bow_merged where "+rows[0]+">0")


Out[777]:
000000 000001 000002 000003 000004 000005 000006 000007 000008 000009 ... 000290 000291 000292 000293 000294 000295 000296 000297 000298 000299
_rowName
[] -0.011037 0.041938 0.002756 0.117649 -0.052991 -0.013769 0.042946 -0.062109 0.107114 0.02232 ... -0.061832 0.04426 -0.03532 0.039322 -0.069882 -0.002956 -0.024179 -0.022233 0.051886 0.007178

1 rows × 300 columns

Compute embedding for all press releases

Using what we say above, we can now calculate each press release's embedding with the following code. To get it for all releases, we can do the call $n$ times, once per press release.


In [729]:
import urllib
import multiprocessing.dummy
def calc_embedding(pr):
    return (pr, mldb.query("select avg({* EXCLUDING(pr*)}) as * from bow_merged where "+pr+">0"))

p = multiprocessing.dummy.Pool(25)
new_res = p.map(calc_embedding, rows)

We can now save the results in a new dataset.


In [771]:
col_names = requests.get("http://localhost/v1/datasets/w2v/columns").json()

ds = mldb.v1.datasets("pr_embed")
ds.put({
    "type":"embedding",
})

for pr, line in new_res:
    if len(line) == 0: continue
    row = []
    for col_name, cell in zip(col_names, line):
        row.append([col_name, float(line[cell]), 0])
    try:
        ds.rows.post({
            "rowName": pr,
            "columns": row
        })
    except:
        print pr, len(line)
        break


ds.commit.post({})


Out[771]:
POST http://localhost/v1/datasets/pr_embed/commit
200 OK

In [743]:
mldb.query("select * from pr_embed where rowName() = '"+rows[0]+"'")


Out[743]:
000000 000001 000002 000003 000004 000005 000006 000007 000008 000009 ... 000290 000291 000292 000293 000294 000295 000296 000297 000298 000299
_rowName
pr000000000085 -0.011037 0.041938 0.002756 0.117649 -0.052991 -0.013769 0.042946 -0.062109 0.107114 0.02232 ... -0.061832 0.04426 -0.03532 0.039322 -0.069882 -0.002956 -0.024179 -0.022233 0.051886 0.007178

1 rows × 300 columns

Use t-SNE for dimensionality reduction

t-SNE is a very powerful technique that can be used to reduce the dimensionality of data in 2d or 3d to visualise it. We use it to determine how to layout the press releases in the scatter plot.


In [762]:
tsne = mldb.v1.procedures("pr_embed_tsne")
tsne.put({
    "type" : "tsne.train",
    "params" : {
        "dataset" : {"id" : "pr_embed"},
        "output" : {"id" : "pr_embed_tsne", "type" : "embedding"},
        "select" : "*",
        "where" : "true",
        "modelFileUrl": "file:///mldb_data/tsne.bin",
        "functionName": "tsne_embed",
        "perplexity": 5
    }
})

mldb.v1.datasets("pr_embed_tsne").delete()

tsne.runs.post({})


Out[762]:
POST http://localhost/v1/procedures/pr_embed_tsne/runs
201 Created
{
  "runStarted": "2015-10-16T03:48:41.192Z", 
  "state": "finished", 
  "runFinished": "2015-10-16T03:48:43.122Z", 
  "id": "2015-10-16T03:48:41.192231Z-5bc7042b732cb41f"
}

In [763]:
mldb.v1.datasets("pr_embed_tsne_merged").put({
    "type" : "merged",
    "params" : {
        "datasets": [
            {"id": "raw"},
            {"id": "pr_embed_tsne"}
        ]
    }
})


Out[763]:
PUT http://localhost/v1/datasets/pr_embed_tsne_merged
201 Created
{
  "status": [
    {
      "num_skipped_lines": 150
    }, 
    null
  ], 
  "config": {
    "params": {
      "datasets": [
        {
          "id": "raw"
        }, 
        {
          "id": "pr_embed_tsne"
        }
      ]
    }, 
    "type": "merged", 
    "id": "pr_embed_tsne_merged"
  }, 
  "state": "ok", 
  "type": "merged", 
  "id": "pr_embed_tsne_merged"
}

Doing the scatter plot

We now have the pr_embed_tsne_merged dataset that represents all of our press releases with $(x,y)$ coordinates in the embedding space.


In [764]:
mldb.query("""
select party, title, x, y 
from pr_embed_tsne_merged 
where to_timestamp(date) > to_timestamp('2015-08-01') 
limit 5""")


Out[764]:
party title x y
_rowName
pr000000000085 conservative Harper Commits To New Funding To Continue Terr... 17.528767 61.783394
pr000000000062 conservative STATEMENT: Harper Extends Best Wishes On The 7... -41.850677 -54.289703
pr000000001003 liberal Fact Check: Desperate NDP makes false claims a... -58.890091 31.784502
pr000000000069 conservative Stephen Harper announces plan to save Canadian... -21.604961 58.552570
pr000000000055 conservative PM Harper Announces Plan To Strengthen The Nat... -14.184937 -59.870262

In [765]:
df = mldb.query("""
select party, title, x, y
from pr_embed_tsne_merged
where to_timestamp(date) > to_timestamp('2015-08-01')
""")

In [766]:
import numpy as np
colormap = {
    "ndp": "#FF8000",
    "liberal": "#DF0101",
    "conservative": "#0000FF",
    "green": "#01DF01",
    "category": "#FE2EC8"
}

import bokeh.plotting as bp
from bokeh.models import HoverTool

In [767]:
#this line must be in its own cell 
bp.output_notebook()


BokehJS successfully loaded.

In [773]:
press_releases = np.array([str(x.encode('ascii','ignore').split("|")[0]) for x in list(df.title.values)])
x = bp.figure(plot_width=900, plot_height=700, title="Press Releases of Canadian Federal Parties During 2015 Elections",
       tools="pan,wheel_zoom,box_zoom,reset,hover,previewsave",
       x_axis_type=None, y_axis_type=None, min_border=1)
x.scatter(
    x = df.x.values, 
    y = df.y.values, 
    color=[colormap[k] for k in df.party.values],
    radius=1,
    fill_alpha=0.5,
    source=bp.ColumnDataSource({"title": press_releases})
).select(dict(type=HoverTool)).tooltips = {"title":"@title"}
bp.show(x)


This plot is interactive. Feel free to explore it by hovering over the different press releases. To tooltip shows the title of each press release.


In [ ]: